Cognizant BFS Innovations – rapidprototypes

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Harshawardhan Nene, BFS Innovations, harshawardhan.nene@cognizant.com [PRIMARY CONTACT]

Tool(s):

MS Excel was used for quick preliminary visualizations.

TGraph2010, a turtle graphics program written in Java was used to try out a few ideas.( http://sourceforge.net/projects/turtlegraph/)

This tool had no documentation and a lot of limitations for the task at hand. Nevertheless, it helped me to quickly try out a bunch of ideas.

The final visualizations were generated using Processing (http://processing.org/), an open source "programming language and IDE built for the electronic arts and visual design communities"

Processing is an impressive tool for creating visualizations and has a good online documentation along with a decent set of tutorials. However, being new to the tool, I had little time to use any of its advanced features.

Thus, most of the data was manipulated in Java using the Netbeans 6.8 IDE (http://netbeans.org/). The output was a set of basic drawing instructions in Processing which would then generate the visualization.

Inkscape (http://www.inkscape.org/) was used for adding captions to the images. The entire visualizations are automatically generated. Only the explanatory text captions were added manually.

 

Video:

 

POWERPOINT

 

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

Answer: Nigeria_B. The skyline visualization takes the base with maximum occurrences at each column in the 58 current-outbreak strains. The resulting sequence is then compared to each native sequence. Longer streaks of matching nucleotides result in larger squares. A mutation is marked by a white line and a new square starts growing with the next streak. The white lines help identify the concentration of mutations in a region. Thus, fewer and larger squares in a skyline indicate fewer mutations. The length of each skyline is 1404 pixels, corresponding to the 1404 nucleotides. This retains positional information and allows comparisons between strains at any given column. The figure at the bottom left of each square indicates the length of the streak.

VISUALIZATION


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

Answer: 123. Using the skyline view for comparing sequence 123 and 51 with Nicolai’s strain, we observe that sequence 123 is closer to that of Nicolai. Strain 51 has two additional mutations (at columns 842 and 946). The nucleotides observed at these columns are better distributed that most other columns (T41, C17 at 842; A44, T14 at 946). Having different nucleotides at these columns further reduces the likelihood that the patient with strain 51 acquired the illness from Nicolai.

VISUALIZATION


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

Answer: A→C 269

            G→C 212

            A→T 946

Nearly 95% of the columns have the same nucleotide across the 58 strains. Only columns showing a distribution of at least 95% / 5% were considered for this question and the next. These columns have been arranged in order with the extreme left column (161) having a nearly even distribution and the extreme right column (821) having the most skewed distribution. The larger groups figure in the top row and the smaller groups in the bottom row. We assume that the smaller groups are the mutations. Each characteristic was simplified to: severe / not severe. The percentage of severe characteristics was calculated for each group in each column. The higher of the two percentages for a given characteristic is highlighted red. Bars have been used to show the absolute difference between the percentages.. The brighter blue bars are the top 3 mutations.

VISUALIZATION


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

Answer: A→C 269

            G→C 223

            A→T 946

The sum of absolute differences is summed up for each column where the characteristics are marked red. Eg. The chance of having severe symptoms is 88% higher if Column 269 has a C instead of a T. The severity is higher for symptoms, mortality, drug resistance and vulnerability at 269 thus 82+88+36+30 gives us 236. The top 3 sums are highlighted and answer the question.

VISUALIZATION